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ABSTRACT 

Traditional  statisticiil  speech  recognition  systems  typically 
make  strong  assumptions  about  the  independence  of  obser¬ 
vation  frames  and  generally  do  not  make  use  of  segmental 
information.  In  contrast,  when  the  segmentation  is  known, 
existing  classifiers  can  readily  accommodate  segmental  infor¬ 
mation  in  the  decision  process.  We  describe  an  approach  to 
connected  word  recognition  that  allows  the  use  of  segmental 
information  through  an  explicit  decomposition  of  the  recog¬ 
nition  criterion  into  classification  and  segmentation  scoring. 
Preliminary  experiments  are  presented,  demonstrating  that 
the  proposed  framework,  using  fixed  length  sequences  of  cep- 
stral  feature  vectors  for  classification  of  individual  phonemes, 
performs  comparably  to  more  traditional  recognition  ap¬ 
proaches  that  use  the  entire  observation  sequence.  We  expect 
that  performance  gain  can  be  obtained  using  this  structure 
with  additional,  more  general  features. 

1.  INTRODUCTION 

Although  hidden-Markov-model  (HMM)  based  speech 
recognition  systems  have  achieved  very  high  perfor¬ 
mance,  it  may  be  possible  to  improve  on  their  perfor¬ 
mance  by  addressing  the  known  deficits  of  the  HMM. 
Perhaps  the  most  obvious  weaknesses  of  the  model  are 
the  reliance  on  frame-based  feature  extraction  and  the 
assumption  of  conditional  independence  of  these  features 
given  an  underlying  state  sequence.  The  assumption  of 
independence  disagrees  with  what  is  known  of  the  ac¬ 
tual  speech  signal,  and  when  this  framework  is  accepted, 
it  is  difficult  to  incorporate  potentially  useful  measure¬ 
ments  made  across  an  entire  segment  of  speech.  Much  of 
the  linguistic  knowledge  of  acoustic-phonetic  properties 
of  speech  is  most  naturally  expressed  in  such  segmental 
measurements,  and  the  inability  to  use  such  measure¬ 
ments  may  represent  a  significant  loss  in  potential  per¬ 
formance. 

In  an  attempt  to  address  this  issue,  a  number  of  mod¬ 
els  have  been  proposed  that  use  segmental  features  as 
the  basis  of  recognition.  Although  these  models  al¬ 
low  the  use  of  segmental  measurements,  they  have  not 
yet  achieved  significant  performance  gains  over  HMMs 
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because  of  difficulties  associated  with  modeling  a  vari¬ 
able  length  observation  with  segmental  features.  Many 
of  these  models  represent  the  segmental  characteristics 
as  a  fixed-dimensional  vector  of  features  derived  from 
the  variable-length  observation  sequence.  Although  such 
features  may  work  quite  well  for  classification  of  individ¬ 
ual  units,  such  as  phonemes  or  syllables,  it  is  less  obvious 
how  to  use  fixed-length  features  to  score  a  sequence  of 
these  units  where  the  number  and  location  of  the  units 
is  not  known.  For  example,  simply  taking  the  product 
of  independent  phoneme  classification  probabilities  using 
fixed  length  measurements  is  inadequate.  If  this  is  done, 
the  total  number  of  observations  used  for  an  utterance 
is  F  X  N,  where  F  is  the  fixed  number  of  features  per 
segment  and  N  is  the  number  of  phonemes  in  the  hypoth¬ 
esized  sentence.  As  a  result,  the  scores  for  hypotheses 
with  different  numbers  of  phonemes  will  effectively  be 
computed  over  different  dimensional  probability  spaces, 
and  as  such,  will  not  be  comparable.  In  particular,  long 
segments  will  have  lower  costs  per  frame  than  short  seg¬ 
ments. 

In  this  paper,  we  address  the  segment  modeling  prob¬ 
lem  using  an  approach  that  decomposes  the  recognition 
process  into  a  segment  classification  problem  and  a  seg¬ 
mentation  scoring  problem.  The  explicit  use  of  a  clas¬ 
sification  component  allows  the  direct  use  of  segmental 
measures  as  well  as  a  variety  of  classification  techniques 
that  are  not  reeidily  accommodated  with  other  formu¬ 
lations.  The  segmentation  score  component  effectively 
normalizes  the  scores  of  different  length  sequences,  mak¬ 
ing  them  comparable. 

2.  CLASSIFICATION  AND 
SEGMENTATION  SCORING 

2.1.  General  Model 

The  goal  of  speech  recognition  systems  is  to  find  the  most 
likely  label  sequence,  A  =  oj, ...,  ajy  given  a  sequence  of 
acoustic  observations,  X.  For  simplicity,  we  can  restrict 
the  problem  to  finding  the  label  sequence,  A,  and  seg¬ 
mentation,  S  =  that  have  the  highest  joint 

likelihood  given  the  observations.  (There  is  typically  no 


197 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

1992 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1992  to  00-00-1992 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Recognition  Using  Classsiflcation  and  Segmentation  Scoring 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

BBN  Technologies, 10  Moulton  Street, Cambridge, MA, 02238 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

5 

19a.  NAME  OF 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


explicit  segmentation  component  in  the  formulation  for 
HMMs;  in  this  case,  the  underlying  state  sequence  is 
analogous  to  the  segmentation-label  sequence.)  The  re¬ 
quired  optimization  is  then  to  find  labels  A*  such  that 


approach  differs  from  [2]  in  the  types  of  models  used  and 
in  the  method  of  obtaining  the  segmentation  score.  In 
[2],  the  classification  and  segmentation  probabilities  are 
estimated  with  separate  multi-layer  perceptrons. 


A*  =  argmaxp(A,  S  I X) 

A,S 

=  argmeixp(A,S,X).  (1) 

A,S 

The  usual  decomposition  of  this  probability  is 

p(A,S,X)  =  p(X|A,S)p(S|A)p(A)  (2) 

as  is  commonly  used  in  HMMs  and  has  been  used  in  our 
previous  segment  modeling.  However,  we  can  consider 
an  alternative  decomposition: 

p(A,S,X)  =  p(A|S,X)p(S,X). 

In  this  case,  the  optimization  problem  has  two  compo¬ 
nents  a  “classification  probability,”  p(A  |  S,X),  and  a 
“probability  of  segmentation” ,  p(S,  X).  We  refer  to  this 
approach  as  classification-in-recognUion  (CIR). 

The  CIR  approach  has  a  number  of  potential  advan¬ 
tages  related  to  the  use  of  a  classification  component. 
First,  segmental  features  can  be  accommodated  in  this 
approach  by  constraining  p(A  |  X,  S)  to  have  the  form 
p(A  I  /(X),  S),  where  /(X)  is  some  function  of  the  orig¬ 
inal  observations.  The  possibilities  for  this  function  in¬ 
clude  the  complete  observation  sequence  itself,  as  well 
as  fixed  dimensional  segmental  feature  vectors  computed 
from  it.  A  second  advantage  is  that  a  number  of  different 
classifiers  can  be  used  to  compute  the  posterior  proba¬ 
bility,  including  neural  networks  and  classification  trees, 
as  well  as  other  approaches. 


2.2.  Classification  Component 

The  formulation  described  above  is  quite  general,  allow¬ 
ing  the  use  of  a  number  of  different  classification  and  seg¬ 
mentation  components.  The  particular  classifier  used  in 
the  experiments  described  below  is  based  on  the  Stochas¬ 
tic  Segment  Model  (SSM)  [4],  an  approach  that  uses  seg¬ 
mental  measurements  in  a  statistical  framework.  This 
model  represents  the  probability  of  a  phoneme  based  on 
the  joint  statistics  of  an  entire  segment  of  speech.  Several 
variants  of  the  SSM  have  been  developed  since  its  intro¬ 
duction  [5,  6],  and  recent  work  has  shown  this  model  to 
be  comparable  in  performance  to  hidden-Markov  model 
systems  for  the  task  of  word  recognition  [7].  The  use  of 
the  SSM  for  classification  in  the  CIR  formalism  is  de¬ 
scribed  next. 

Using  the  formalism  of  [4],  p(X(s,)|s,-,  a,)  is  character¬ 
ized  as  p(/(X(s,))|s,-, a,),  where  /(•)  is  a  linear  time 
warping  transformation  that  maps  variable  length  X(si ) 
to  a  fixed  length  sequence  of  vectors  Y  =  /(X(s,  )).  The 
specific  model  for  Y  is  multi-variate  Gaussian,  gener¬ 
ally  subject  to  some  assumptions  about  the  covariance 
structure  to  reduce  the  number  of  free  parameters  in 
the  model.  The  posterior  probability  used  in  the  clas¬ 
sification  work  here  is  obtained  from  this  distribution 
according  to 


p(a<  |/(X(s,)),s,)  = 


p(/(X(s.))  |a,,s.)p(a,-,s,) 
Ea;  p(/(x(5,))  1  a,-,  s.)  p(a,-,  s.)  ■ 


To  simplify  initial  experiments,  we  have  made  the  as¬ 
sumption  that  phoneme  segments  are  generated  inde¬ 
pendently.  In  this  case  (1)  is  rewritten  as 

A*  =  argmax  n  p(a,-  I  X(s<),  s,)p(si,  X(s,)) 

A,S  i 

where  a,-  is  one  label  of  the  sequence,  s,-  is  a  single  seg¬ 
ment  of  the  segmentation^,  and  X(s,)  is  the  portion  of 
the  observation  sequence  corresponding  to  Si .  Segmental 
features  are  incorporated  by  constraining  p(ai  |X(s,),Sj) 
to  be  of  the  form  p(a,-  |/(X(si)),  Sj),  as  mentioned  above. 

There  are  a  number  of  segment-based  systems  that  take 
a  classification  approach  to  recognition  [1,  2,  3].  With 
the  exception  of  [2],  however,  these  do  not  include  an  ex¬ 
plicit  computation  of  the  segmentation  probability.  Our 

*If  Si  is  defined  as  the  start  and  end  times  of  the  segment, 
clearly  consecutive  Si  are  not  independent.  To  avoid  this  problem, 
we  think  of  Si  as  corresponding  to  the  length  of  the  segment. 


There  are  more  efficient  methods  for  direct  computation 
of  the  posterior  distribution  p(ai  |  /(X(s,)),  Sj),  such  as 
with  tree-based  classifiers  or  neural  networks.  However, 
the  above  formulation,  which  uses  class-conditional  den¬ 
sities  of  the  observations,  p(/(X(s,))  |  a,-,5,),  has  the 
advantage  that  we  can  directly  compare  the  CIR  ap¬ 
proach  to  the  traditional  approach  and  therefore  better 
understand  the  issues  ^lssociated  with  using  fixed-length 
measurements  and  the  effect  of  the  segmentation  score. 
In  addition,  this  approach  allows  us  to  take  advantage 
of  recent  improvements  to  the  SSM,  such  as  the  dynam¬ 
ical  system  model  [6],  at  a  potentially  lower  cost  due  to 
subsampling  of  observations. 

2.3.  Segmentation  Component 

There  are  several  possibilities  for  estimating  the  segmen¬ 
tation  probability,  and  two  fundamentally  different  ap¬ 
proaches  are  explored  here.  First  we  note  that  we  can 
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estimate  either  p(S  |X)  or  p(S,X)  for  the  segmentation 
probability,  leading  to  the  two  equivalent  expressions  in 
(1). 

One  method  is  to  simply  compute  a  mixture  distribution 
of  segment  probabilities  to  find  p(si,X(s,)): 

p(si,X(s,))  =  ^p(si,X(s,),Cj) 

3 

=  YL  p(^(**)  |si .  Cj  )p(si ,  Cj )  (3) 

3 

where  {cj}  is  a  set  of  classes,  such  as  linguistic  classes  or 
context-independent  phones.  In  order  to  find  the  score 
for  the  complete  sequence  of  observations,  the  terms  in 
the  summation  in  (3)  are  instances  of  the  more  tradi¬ 
tional  formulation  of  (2).  This  method  uses  the  complete 
observation  sequence,  as  in  [4],  to  determine  the  segmen¬ 
tation  probabilities,  as  opposed  to  the  features  used  for 
classification,  which  rhay  be  substantially  reduced  from 
the  original  observations  and  may  lack  some  cues  to  seg¬ 
ment  boundaries,  such  as  transitional  acoustic  events. 

Another  method  for  computing  the  segmentation  prob¬ 
ability,  similar  to  that  presented  in  [2],  is  to  find  the 
posterior  probability  p(S  |  X).  In  this  approach,  we  use 
distributions  that  model  presence  versus  absence  of  a 
segment  boundary  at  each  frame,  based  on  local  features. 
The  segmentation  probability  is  written  as 

p(S(X)  =  np(s.(X(sO)  (4) 

and  the  probability  of  an  individual  segment  of  length  L 
is 

i-i 

p(si  I  X(sO)  =  p(6l  I  X(sO)  n  P(^  I  (5) 

;=i 

where  6£,  is  the  event  that  there  is  a  boundary  after  frame 
L  and  bj  is  the  event  that  there  is  not  a  boundary  after 
the  jth  frame  of  the  segment.  We  estimate  the  frame 
boundary  probabilities  as 

where  K  =  p(6)/p(6)  and 

£,  —  P^^.?  ’  ^3+^  I 

p{xj,Xj+i\bj)' 

The  component  conditional  probabilities  are  computed 
as 

p(x; ,  Xj+1  I  6~)  =  ^  p{xj ,Xj+i\^)  p(/?)  (6) 


and 

P(®ji  ®i+i  I  ~  EEp<  I  /?l)P(*>  +  l  I  /?2)  P(A  ,  P2), 

Pi  Pi 

(7) 

where  P  ranges  over  the  m^lnner-of-articulation  phoneme 
classes;  stops,  nasals,  fricatives,  liquids,  vowels,  and  ad¬ 
ditionally,  silence. 

The  two  segmentation  models  presented  have  different 
advantages.  The  first  method  makes  use  of  the  complete 
set  of  SSM  phone  models  in  determining  likely  bound¬ 
aries  for  each  segment  and  hence  may  have  a  more  com¬ 
plete  model  of  the  speech  process.  On  the  other  hand, 
the  second  approach  uses  models  explicitly  trained  to  dif¬ 
ferentiate  between  boundary  and  non-boundary  acoustic 
events.  The  best  choice  of  segmentation  score  is  an  em¬ 
pirical  question  that  we  have  begun  to  address  in  this 
work. 

3.  EXPERIMENTS 

Experiments  have  been  conducted  to  determine  the  feasi¬ 
bility  of  the  recognition  approach  described  here.  First, 
we  wished  to  determine  whether  fixed-length  measure¬ 
ments  could  be  as  effective  in  recognition  as  using  the 
complete  observation  sequence,  as  is  normally  done  in 
other  SSM  work  and  in  HMMs.  This  test  would  tell 
whether  the  segmentation  score  can  compensate  for  the 
use  of  fixed-length  measurements.  Second,  we  investi¬ 
gated  the  comparative  performance  of  the  two  segmen¬ 
tation  scoring  mechanisms  outlined  in  the  previous  sec¬ 
tion. 

3.1.  CIR  Feasibility 

The  feasibility  of  fixed-length  measurements  was  in¬ 
vestigated  first  in  a  phoneme  classification  framework. 
Since  we  planned  to  eventually  test  our  algorithms  in 
word  recognition  on  the  Resource  Management  (RM) 
database,  our  phone  classification  experiments  were  also 
run  on  this  database.  Since  the  RM  database  is  not  pho¬ 
netically  labeled,  we  used  an  automatic  labeling  scheme 
to  determine  the  reference  phoneme  sequence  and  seg¬ 
mentation  for  each  sentence  in  the  database.  The  la¬ 
beler,  a  context-dependent  SSM,  took  the  correct  ortho¬ 
graphic  transcription,  a  pronunciation  dictionary,  and 
the  speech  for  a  sentence  and  used  a  dynamic  program¬ 
ming  algorithm  to  find  the  best  phonetic  alignment.  The 
procedure  used  an  initial  labeling  produced  by  the  BBN 
BYBLOS  system  [8]  as  a  guide,  but  allowed  some  varia¬ 
tion  in  pronunciations,  according  to  the  dictionary,  as 
well  as  in  segmentation.  The  resulting  alignment  is 
flawed  in  comparison  with  carefully  hand  transcribed 
speech,  as  in  the  TIMIT  database.  However,  our  ex¬ 
perience  has  shown  that  using  comparable  models  and 
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analysis,  there  is  only  about  a  4-6%  loss  in  classificatioD 
performance  (e.g.,  from  72%  to  68%  correct  for  context- 
independent  models)  between  the  two  databases,  and  the 
RM  labeling  is  adequate  for  mating  preliminary  compar¬ 
isons  of  classification  algorithms.  The  final  test  of  any 
classification  algorithm  is  maxie  under  the  CIR  formal¬ 
ism  in  word  recognition  experiments,  for  which  the  RM 
database  is  well  suited. 

In  classification,  the  observation  vectors  in  each  segment 
were  linearly  sampled  to  obtain  a  fixed  number  of  vec¬ 
tors  per  segment,  m  =  5  frames.  For  observed  segments 
of  length  less  than  five  frames,  the  transformation  re¬ 
peated  some  vectors  more  than  once.  The  feature  vector 
for  each  frame  consisted  of  14  Mel-warped  cepstral  co¬ 
efficients  and  their  first  differences  as  well  as  differenced 
energy.  Each  of  the  m  distributions  of  each  segment 
were  modeled  as  independent  full  covariance  Gaussian 
distributions.  Separate  models  were  trained  for  males 
and  females  by  iteratively  segmenting  and  estimating  the 
models  using  the  algorithm  described  in  [4].  The  testing 
material  came  from  the  standard  “Feb89”  and  “Oct89” 
test  sets.  In  classification  experiments  using  the  Feb89 
test  set,  the  percent  correct  is  reported  over  the  complete 
set  of  phoneme  instances,  11752  for  our  transcription. 
Several  simplifying  assumptions  were  made  to  facilitate 
implementation.  Only  context-independent  models  were 
estimated,  and  the  labels  and  segments  of  the  observa¬ 
tion  sequence  were  considered  independent. 

On  the  Feb89  test  set  the  classification  results  were 
65.8%  correct  when  the  entire  observation  sequence  was 
used  and  66.4%  correct  when  a  fixed  number  of  obser¬ 
vations  was  used  for  each  segment.  This  result  indicates 
that,  in  classification,  using  fixed  length  measurements 
can  work  as  well  as  using  the  entire  observation. 

Having  verified  that  fixed-length  features  are  useful  in 
classification,  the  next  step  was  to  evaluate  their  use  in 
recognition  with  the  CIR  formalism.  In  recognition,  we 
make  use  of  the  Af-best  formalism.  Although  originally 
developed  as  an  interface  between  the  speech  and  natu¬ 
ral  language  components  of  a  spoken  leinguage  system  [9], 
this  mechanism  can  also  be  used  to  rescore  hypotheses 
with  a  variety  of  knowledge  sources  [10].  Each  knowl¬ 
edge  source  produces  its  own  score  for  every  hypothesis, 
and  the  decision  as  to  the  most  likely  hypothesis  is  de¬ 
termined  according  to  a  weighted  combination  of  scores 
from  all  knowledge  sources.  The  algorithm  reduces  the 
search  of  more  computationally  expensive  models,  like 
the  SSM,  by  eliminating  very  unlikely  sentences  in  the 
first  pass,  performed  with  a  less  expensive  model,  such 
as  the  HMM.  In  this  work,  the  BBN  BYBLOS  system 
[8]  is  used  to  generate  20  hypotheses  per  sentence. 


Using  the  iV-best  formalism,  an  experiment  was  run 
comparing  the  CIR  recognizer  to  an  SSM  recognizer  that 
uses  all  observations.  The  classifier  for  the  CIR  system 
was  the  same  as  that  used  in  the  previous  experiment. 
The  joint  probability  of  segmentation  and  observations, 
p(X,  S),  was  computed  as  in  Equation  (3),  using  a  ver¬ 
sion  of  the  SSM  that  considered  the  complete  observa¬ 
tion  sequence  for  a  segment.  That  is,  not  just  m,  but  all 
observation  vectors  in  a  segment  were  mapped  to  the  dis¬ 
tributions  and  used  in  finding  the  score.  The  weights  for 
combining  scores  in  the  Al-best  formalism  were  trained 
on  the  Feb89  test  set.  In  this  case  the  scores  to  be  com¬ 
bined  were  simply  the  SSM  score,  the  number  of  words 
and  the  number  of  phonemes  in  a  sentence. 

In  evaluating  performance  using  the  Al-best  formalism, 
the  percent  word  error  is  computed  from  the  highest- 
ranked  of  the  rescored  hypotheses.  On  the  Feb89  test  set 
the  word  error  for  both  the  classification-in-recognition 
method  and  the  original  recognition  approach  was  9.1%. 
To  determine  if  these  results  were  biased  due  to  train¬ 
ing  the  weights  for  combining  scores  on  the  same  test 
data,  this  experiment  was  repeated  on  the  Oct89  test 
set  using  the  weights  developed  on  the  Feb89  test  set. 
The  performance  for  the  CIR  recognizer  was  9.4%  word 
error  (252  errors  in  a  set  of  2684  reference  words)  and 
the  performance  for  the  original  approach  using  the  com¬ 
plete  observation  sequence  was  9.1%  word  error  (244  er¬ 
rors).  The  performance  of  the  new  recognition  formal¬ 
ism  is  thus  very  close  to  that  of  the  original  scheme,  and 
in  fact  the  difference  between  them  could  be  attributed 
to  differences  associated  with  suboptimal  A'-best  weight 
estimation  techniques  [11]. 

3.2.  Segmentation  Score 

As  mentioned  previously,  some  current  systems  use  a 
classification  scheme  with  no  explicit  probability  of  seg¬ 
mentation.  We  attempted  to  simulate  this  effect  with 
the  classification  recognizer  by  simply  suppressing  the 
score  for  the  joint  probability  of  segmentation  and  ob¬ 
servations.  This  is  equivalent  to  assuming  that  the  seg¬ 
mentation  probabilities  are  equally  likely  for  all  hypothe¬ 
ses  considered.  Scores  were  computed  for  the  utterance 
with  and  without  the  p(X,  S)  term  on  the  Feb89  test 
set.  When  just  the  classification  scores  were  used,  word 
error  went  from  from  9.1%  to  10.8%,  an  18%  degrada¬ 
tion  in  performance.  Apparently,  the  joint  probability  of 
segmentation  and  observations  has  a  significant  effect  in 
normalizing  the  posterior  probability  for  better  recogni¬ 
tion. 

Experiments  were  also  run  to  compare  the  two  meth¬ 
ods  of  segmentation  scoring  described  above.  In  the  first 
method,  based  on  equation  (3),  the  same  analysis  de- 
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scribed  earlier  was  used  at  each  frame  (cepstra  plus  dif¬ 
ferenced  cepstra  and  differenced  energy)  and  the  sum¬ 
mation  was  over  the  set  of  context  independent  phones. 
In  the  second  method,  which  computes  p(S  |  X)  using 
equations  (4)  -  (7),  we  modeled  each  of  the  conditional 
densities  in  (6)  and  (7)  as  the  joint,  full  covariance,  Gaus¬ 
sian  distribution  of  the  cepstral  parameters  of  the  two 
frames  adjoining  the  hypothesized  boundary.  In  order 
to  reduce  the  number  of  free  parameters  to  estimate  in 
the  Gaussian  model,  we  used  only  the  cepstral  coeffi¬ 
cients  as  features  for  each  frame.  On  the  Feb89  test 
set  the  first  method  had  9.1%  combined  word  error  for 
male  and  female  speakers,  while  the  second  method  had 
11.0%  word  error.  Using  the  best  weights  for  the  N-best 
combination  from  this  test  set,  the  segmentation  algo¬ 
rithms  were  also  run  on  the  Oct89  test  set.  In  this  case, 
the  word  error  rates  for  the  two  methods  were  9.4%  and 
11.9%,  respectively. 

This  result  suggests  that  the  boundary-based  segmenta¬ 
tion  score  yields  performance  that  is  worse  than  no  seg¬ 
mentation  score.  However,  the  “no  segmentation”  case 
actually  uses  an  implicit  segmentation  score  in  that  the 
N  hypotheses  are  assumed  to  have  equally  likely  seg¬ 
mentations  (while  all  other  segmentations  have  proba¬ 
bility  zero)  and  in  that  phoneme  and  word  counts  are 
used  in  the  combined  score.  Although  we  suspect  that 
the  marginal  distribution  model  for  segmentation  scores 
may  still  be  preferable,  clearly  more  experiments  are 
needed  with  a  larger  number  of  sentence  hypotheses  to 
better  understand  the  characteristics  of  the  different  ap¬ 
proaches. 

4.  DISCUSSION 

In  summary,  we  have  described  an  alternative  approach 
to  speech  recognition  that  combines  classification  and 
segmentation  scoring  to  more  effectively  use  segmental 
features.  Our  pilot  experiments  demonstrate  that  the 
classification-in-recognition  approach  can  achieve  per¬ 
formance  comparable  to  the  traditional  formalism  when 
frame-based  features  and  equivalent  Gaussiein  distribu¬ 
tions  are  used,  and  that  the  segmentation  score  can  be  an 
important  component  of  a  classification  approach.  We 
anticipate  performance  gains  with  the  additional  use  of 
segmental  features  in  the  classification  component  of  the 
CIR  model.  We  also  plan  to  extend  the  model  to  incor¬ 
porate  context-dependent  units. 

Our  initial  experiments  with  the  segmentation  probabil¬ 
ity  indicate  that  finding  this  component  via  mEirginal 
probabilities  computed  with  a  detailed  model  may  be 
more  accurate  than  estimating  boundary  likelihood 
based  on  local  observations,  although  this  conclusion 
should  be  verified  with  experiments  using  a  larger  num¬ 


ber  of  hypotheses  per  sentence  than  the  20  used  so  far. 
A  number  of  improvements  can  be  meide  to  both  mod¬ 
els,  including  using  different  choices  for  mixture  com¬ 
ponents  and  eliminating  some  of  the  independence  as¬ 
sumptions.  Additionally,  in  the  second  method  we  plan 
to  increase  both  the  number  of  features  per  frame  and 
the  number  of  boundary-adjacent  frames  considered  in 
computing  the  boundary  probabilities.  Eventually  a  hy¬ 
brid  method  that  combines  elements  of  both  approaches 
may  prove  to  be  the  most  effective. 
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